{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Enron Dataset Preprocessing : Format the dataset into time series\n",
    "\n",
    "In this notebook, we format the parsed dataset into a multivariate time series."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 1: Import libraries"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "import re\n",
    "import sys\n",
    "import time\n",
    "from datetime import datetime\n",
    "from collections import defaultdict\n",
    "from collections import Counter\n",
    "\n",
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 2: Load the parsed dataset\n",
    "\n",
    "The parsed dataset should first be generated from the previous notebook."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = pd.read_csv('enron_dataset_raw.csv', encoding='utf-8', index_col=0)\n",
    "df.shape # Shape: (517401, 13)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df[['user', 'From', 'To', 'X-Origin', 'X-From', 'X-To', 'date', 'timestamp', 'file']].head(3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 2: Preprocess the dataframe\n",
    "\n",
    "### Step 2.1: Basic cleaning & filtering"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Deal with `Na` values. I.e. Drop columns with mostly `Na` and drop rows with remaining `Na`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Drop columns Cc, Bcc, X-cc, X-bcc which are mostly Na\n",
    "df = df.drop(columns=['Cc', 'Bcc', 'X-cc', 'X-bcc'])\n",
    "\n",
    "# Remove rows with missing values is empty strings\n",
    "df = df.dropna()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Check the shape."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.shape # Shape: (495547, 9)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Apply some basic manual fixes to some fields."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "# Rename columns\n",
    "df = df.rename(columns={'user': 'user_id'})\n",
    "\n",
    "# Get the folder of the email from the `file` field\n",
    "df['folder'] = df.file.apply(lambda fpath: fpath.split('/')[-2])\n",
    "\n",
    "# Remove doublon from a user that appears to be duplicated (i.e. has a single email address for two different usernames)\n",
    "df = df[df['user_id'] != 'whalley-l']\n",
    "\n",
    "# Remove 'Taylor, Michael E' mixed up with 'Mark Taylor' in user 'taylor-m' that appears to be duplicated (i.e. has a single email address for two different usernames)\n",
    "df = df[~df['X-From'].apply(lambda s: 'Taylor, Michael E' in s)]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Filter mails from/to addresses outside of enron."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "mask_from_enron = df['From'].apply(lambda s: '@enron.com' in s)\n",
    "mask_to_enron = df['To'].apply(lambda s: '@enron.com' in s)\n",
    "df = df[mask_to_enron & mask_from_enron]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Filter users with not enough mails sent."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Minimum number of emails per use to pass filter\n",
    "MIN_MAIL_COUNT = 50\n",
    "\n",
    "df_user_count = df['user_id'].value_counts() \n",
    "users_to_keep = set(df_user_count[df_user_count > MIN_MAIL_COUNT].index)\n",
    "mask_has_enough_mails = df['user_id'].apply(lambda u: u in users_to_keep)\n",
    "df = df[mask_has_enough_mails]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Generate unique index for each email."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Reset the index\n",
    "df = df.reset_index(drop=True)\n",
    "\n",
    "# Generate a unique `id` for each email before we duplicated them  by splitting the `To` fields\n",
    "df['mail_id'] = df.index.copy()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Visualize a few examples."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(df.shape) # Shape: (347051, 11)\n",
    "df.head(3) "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Step 2.2: Identify senders in inbox folders\n",
    "\n",
    "#### Match user ids with X-From fields\n",
    "\n",
    "Some emails occur in the inbox of some users but the sent emails of their senders. To augment the dataset with these emails, we match the users with their `X-From` field."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from collections import Counter\n",
    "\n",
    "def clean_xname_str(s):\n",
    "    return [re.sub(' <.+>', '', re.sub('\\n\\t', '', addr)) for addr in s.split(', ')]\n",
    "\n",
    "def get_user_sent_mask(user):\n",
    "    return (df.user_id == user) & pd.concat([(df.folder == folder) for folder in ['sent', 'sent_items', '_sent_mail']], axis=1).any(axis=1)\n",
    "\n",
    "def clean_match(match):\n",
    "    match_dict = match.groupdict()\n",
    "    if match_dict['email']:\n",
    "        return match_dict['email'].strip(\"'\")\n",
    "    elif match_dict['name']:\n",
    "        return re.sub(' <.+>', '', match_dict['name']).strip().lower()\n",
    "    elif match_dict['single_name']:\n",
    "        return match_dict['single_name'].strip().lower()\n",
    "    else:\n",
    "        raise ValueError('Invalid match')\n",
    "    \n",
    "re_is_xname_str = r\"(?P<name>[\\w. ]+, [\\w. ]+( \\([\\w. ]+\\))? <[^>]+>)\"\n",
    "re_is_mail_str = r\"(?P<email>'[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\\.[a-zA-Z0-9-.]+')\"\n",
    "re_is_single_name_str = r\"^(?P<single_name>[\\w. ]+(, [\\w. ]+)?)$\"\n",
    "reg_xname = re.compile('|'.join([re_is_xname_str, re_is_mail_str, re_is_single_name_str]))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "xfrom_user_count_dict = {}\n",
    "for user in df.user_id.unique():\n",
    "    user_sent_mask = get_user_sent_mask(user)\n",
    "    if sum(user_sent_mask) > 0:\n",
    "        arr = np.hstack([list(map(clean_match, reg_xname.finditer(xfrom_str))) for xfrom_str in df[user_sent_mask]['X-From'].values])\n",
    "        xfrom_user_count_dict[user] = sorted(Counter(arr).items(), key=lambda e: e[1], reverse=True)\n",
    "    else:\n",
    "        print('Error on user', user)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "filtered_xfrom_user_dict = {}\n",
    "\n",
    "# Filter names that does not contain the lastname of the user\n",
    "for user, name_list in xfrom_user_count_dict.items():\n",
    "    if len(name_list) > 1:\n",
    "        cleaned_name_list = []\n",
    "        for name,_ in name_list:\n",
    "            lastname = user.split('-')[0]\n",
    "            if lastname in name:\n",
    "                cleaned_name_list.append(name)\n",
    "    else:\n",
    "        cleaned_name_list = [name_list[0][0]]\n",
    "    filtered_xfrom_user_dict[user] = cleaned_name_list\n",
    "\n",
    "# Add inversions of lastname/firstname to all users\n",
    "for user, name_list in filtered_xfrom_user_dict.items():\n",
    "    new_name_set = set(name_list)\n",
    "    for name in name_list:\n",
    "        if ',' in name:\n",
    "            # For all `lastname, firstname`, add `firstname lastname`\n",
    "            lname, fname = name.split(',')\n",
    "            new_name = fname.strip() + ' ' + lname.strip()\n",
    "            new_name_set.add(new_name)\n",
    "        else:\n",
    "            # For all `firstname lastname`, add `lastname, firstname`\n",
    "            names = name.split(' ')\n",
    "            new_name = names[-1].strip() + ', ' + ' '.join(names[:-1]).strip()\n",
    "            new_name_set.add(new_name)\n",
    "    filtered_xfrom_user_dict[user] = list(new_name_set)\n",
    "\n",
    "# Manual editions for bogus cases\n",
    "filtered_xfrom_user_dict['hodge-j'] = ['hodge, john', 'hodge, jeffrey t.', 'hodge, jeffrey t', 'jeffrey t. hodge']\n",
    "filtered_xfrom_user_dict['crandell-s'] = ['crandall, sean', 'sean crandall']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Make a dataframe to count the number of names by user, and manually check the users with the largest number of names to investigate potential bogus cases."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "counter = list(map(lambda e: (e[0], len(e[1])), filtered_xfrom_user_dict.items()))\n",
    "df_xfrom_count = pd.DataFrame(counter, columns=['user_id', 'count'])\n",
    "df_xfrom_count.sort_values(by='count').iloc[-5:]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Match user ids with X-To fields"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def get_user_inbox_mask(user):\n",
    "    return (df.user_id == user) & (df.folder == 'inbox')\n",
    "\n",
    "xto_user_dict = {}\n",
    "for user in df.user_id.unique():\n",
    "    user_inbox_mask = get_user_inbox_mask(user)\n",
    "    if sum(user_inbox_mask) > 0:\n",
    "        arr = np.hstack([list(map(clean_match, reg_xname.finditer(xto_str))) for xto_str in df[user_inbox_mask]['X-To'].values])\n",
    "        xto_user_dict[user] = set(arr)\n",
    "    else:\n",
    "        print('Error on user', user)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Make a dataframe to count the number of names of a user shared both in sent and inbox messages, and check that no intersection is null."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_xto_count = pd.DataFrame([(user, set(from_name_list).intersection(set(xto_user_dict[user]))) for user, from_name_list in filtered_xfrom_user_dict.items() if user in xto_user_dict],\n",
    "                            columns=['user_id', 'xfrom_xto_intersection'])\n",
    "df_xto_count['count'] = df_xto_count.xfrom_xto_intersection.apply(len)\n",
    "\n",
    "df_xto_count.sort_values(by='count', ascending=False).tail(5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Build the mapping `X-To`/`X-From` name to `user_id`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "xname_to_userid_mapping = {name: user for user, name_list in filtered_xfrom_user_dict.items() for name in name_list}\n",
    "\n",
    "# Check that it has the same size of filtered_xfrom_user_dict (i.e. check that no names were duplicated and overriden in the dict)\n",
    "assert sum(map(len, filtered_xfrom_user_dict.values())) == len(xname_to_userid_mapping)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def extract_sender_id(xfrom_field):\n",
    "    match = reg_xname.search(xfrom_field)\n",
    "    if match:\n",
    "        return xname_to_userid_mapping.get(clean_match(match))\n",
    "    else:\n",
    "        return None\n",
    "\n",
    "def extract_receiver_id_list(xto_field):\n",
    "    receiver_id_list = list()\n",
    "    for match in reg_xname.finditer(xto_field):\n",
    "        if match:\n",
    "            receiver_id_list.append(xname_to_userid_mapping.get(clean_match(match)))\n",
    "    return receiver_id_list\n",
    "\n",
    "df['sender_id'] = df['X-From'].apply(lambda s: extract_sender_id(s))\n",
    "df['receiver_id_list'] = df['X-To'].apply(lambda s: extract_receiver_id_list(s))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Step 2.3: Build the timeseries dataframe\n",
    "\n",
    "We now format the resulting dataframe into times series of sent emails per user."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_timeseries = df[~df.sender_id.isnull()][['sender_id', 'timestamp']].groupby('sender_id')['timestamp'].apply(list).to_frame()\n",
    "df_timeseries = df_timeseries.rename(columns={'timestamp': 'timestamps'})\n",
    "\n",
    "# Remove emails before January 1, 2000\n",
    "min_timestamp = time.mktime(datetime(2000,1,1).timetuple())\n",
    "df_timeseries['timestamps'] = df_timeseries['timestamps'].apply(np.array)\n",
    "df_timeseries['timestamps'] = df_timeseries['timestamps'].apply(np.unique)\n",
    "df_timeseries['timestamps'] = df_timeseries['timestamps'].apply(lambda events_m: events_m[events_m > min_timestamp])\n",
    "\n",
    "# Add the count of events per user\n",
    "df_timeseries['num_events_m'] = df_timeseries['timestamps'].apply(len)\n",
    "\n",
    "# Filter out users with less than 10 events\n",
    "df_timeseries = df_timeseries[df_timeseries.num_events_m > 10]\n",
    "\n",
    "df_timeseries.num_events_m.sum() # Output: 74294"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Save time series to json file."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_timeseries.to_json('enron_dataset_timeseries.json')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.8"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
